BarWare: efficient software tools for barcoded single-cell genomics

Background Barcode-based multiplexing methods can be used to increase throughput and reduce batch effects in large single-cell genomics studies. Despite advantages in flexibility of sample collection and scale, there are additional complications in the data deconvolution steps required to assign each cell to their originating samples. Results To meet computational needs for efficient sample deconvolution, we developed the tools BarCounter and BarMixer that compute barcode counts and deconvolute mixed single-cell data into sample-specific files, respectively. Together, these tools are implemented as the BarWare pipeline to support demultiplexing from large sequencing projects with many wells of hashed 10x Genomics scRNA-seq data. Conclusions BarWare is a modular set of tools linked by shell scripting: BarCounter, a computationally efficient barcode sequence quantification tool implemented in C; and BarMixer, an R package for identification of barcoded populations, merging barcoded data from multiple wells, and quality-control reporting related to scRNA-seq data. These tools and a self-contained implementation of the pipeline are freely available for non-commercial use at https://github.com/AllenInstitute/BarWare-pipeline. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04620-2.

microfluidic chips, eliminating common sources of technical variation and mitigating the risk of sample dropout due to loss of any single well (Fig. 1A). We have recently implemented this approach at scale for multimodal immunomonitoring of patient samples to study the immune system [5].
With the advantages of Cell Hashing come additional complications related to data processing: samples are no longer directly associated with a single set of well indices and must be demultiplexed at two levels, both by well and by sample, for downstream analysis (Fig. 1B). Previously published tools for barcode quantification were very flexible but slow and did not include the ability to easily repartition count matrices for each sample. To implement Cell Hashing at scale, we developed BarWare, an efficient and comprehensive pipeline consisting of two modular tools linked by shell scripting: BarCounter, for fast, efficient tabulation of HTO counts per cell barcode, and BarMixer, an R package that provides code to quickly redistribute samples across wells and report results and quality control (QC) metrics in user-friendly reproducible RMarkdown reports.
We show that BarCounter outperforms other Hash Tag Oligo (HTO) counting tools and demonstrate the BarWare Cell Hashing pipeline using a large benchmark dataset generated by progressive overloading of the 10x Chromium v3 3′ RNA-seq assay. These capabilities, combined with an emphasis on automated quality control reporting, make BarWare a scalable, user-friendly, and comprehensive toolkit for Cell Hashing that can be efficiently applied to large-scale sequencing projects with many wells of 10x 3′ RNA-seq data.

Efficient barcode counting with BarCounter
We identified HTO counting as a significant bottleneck in the processing of Cell Hashing data. In particular we found that a popular and widely used tool, CITE-seq Count [15], scaled poorly to highly overloaded wells both in terms of processing time and memory. As the cost of single-cell sequencing continues to decline, large Cell Hashing and CITEseq experiments on the order of hundreds of thousands to millions of cells are being generated. To facilitate rapid and parallel processing of large datasets we developed Bar-Counter: a fast, scalable HTO counting program implemented in C and optimized with speed and memory use in mind. Briefly, BarCounter parses paired-end FASTQ data into cell barcode, Unique Molecular Identifier (UMI), and hashtag sequences, then matches barcodes and hashtags against a user-provided cell barcode whitelist and hashtag sequence list, respectively. To account for sequencing errors, BarCounter allows a single base mismatch in hashtag sequences and a single low quality basecall (Q < 20) mismatch in cell barcodes. BarCounter processes each read independently and utilizes the trie data structure (also known as a prefix tree) to perform cell barcode and UMI lookups in constant time ( Fig. 2A-C).

Assignment of counts to hashed populations
The BarMixer package includes tools to convert raw HTO counts from BarCounter into assignments of each cell to their sample of origin. BarMixer assigns barcodes as "singlet", "doublet", "multiplet", or "no hash" based on dynamically determined UMI cutoffs specific to each hash sequence in each well. For each hashtag, a distribution of HTO counts across all cell barcodes is generated, and a cutoff value delineating positive and negative barcodes is assigned ( Fig. 2G and "Methods"). Barcode categories are determined based on the number of positive hashes, e.g. cell barcodes positive for a single hash are classified as singlets. Barcodes are labelled with sample names corresponding to each At runtime, the user provides a barcode whitelist which is loaded into a trie for rapid lookups, a taglist containing all valid hashtag sequences, and paired Read 1 and Read 2 FASTQ files. For each read, checks are performed to verify the cell barcode exists in the barcode trie, and the hashtag sequence is in the taglist. The UMI sequence is checked against a trie and if it is not present, the trie is updated and the counts for the barcode and hashtag combination are incremented. D-F Benchmarking comparisons of BarCounter and other available HTO counting algorithms as a function of increasing cell loading per 10x Genomics well: cellranger count (10x Genomics); CITE-seq-Count (with or without barcode correction, [15], KITE, [4] (single-threaded or with 8 threads). D Maximum memory usage, E Average CPU load, F Elapsed time. G Overview of the barcode cutoff determination method used by BarMixer: Raw counts generated by BarCounter are clipped to remove low values, then log transformed, and used as input to 2-cluster K-means. If cluster medians are separable, the cutoff is set to the lowest value in the positive cluster. Note broken y-axis in the first two panels.

Distribution of cells with BarMixer
Sample-specific datasets are prepared via BarMixer by performing three key steps. For each well, BarMixer annotates Cell Ranger filtered HDF5 files with QC characteristics and cell metadata. Then, BarMixer uses the sample assignments for each cell to split data into separate HDF5 files by sample. Finally, BarMixer merges data across all processed wells based on the sample assignments. This yields a separate, merged HDF5 file for each sample, a merged HDF5 file for all multiplets, and metric reports in JSON and HTML format. Reports include relevant sequencing QC metrics, alignment distributions by barcode category, UMI and gene count distributions by hashtag, and median count data by both sample and well.

Progressive cell overloading to assess demultiplexing
We evaluated the BarWare pipeline and related tools by conducting a progressive cell overloading experiment (Fig. 3). We used fluorescence activated cell sorting (FACS) to separate a sample of peripheral blood mononuclear cells (PBMCs) into naïve T cells, memory T cells, and non-T cell PBMC populations. Each sorted population was divided into two technical replicates for a total of six samples that were stained with commercially available HTO antibodies (BioLegend TotalSeq-A). The six samples were pooled together and loaded into six wells of a 10x Chromium v3 3′ RNA-seq chip at inputs of 16k, 24k, 32k, 48k, 64k, and 80k cells per well (Fig. 3A). Sequencing depth was scaled linearly with predicted cell recovery by well. Hashtag read counts ranged from approximately 40M for 16,000 cell loading to ~ 163M for 80,000 cell loading (Additional file 3: Table S3). This dataset provides a unique test case for HTO counting that is applicable across a wide range of cell numbers and read counts.

Comparison of BarCounter to HTO counting tools
We compared BarCounter to other popular software tools for HTO counting, including CITE-seq-Count (with and without optional UMI correction), Cell Ranger count (10x Genomics), and kallisto indexing and tag extraction (KITE) in both single and multithreaded modes [4]. Some of these methods perform computationally costly Unique Molecular Identifier (UMI) correction because sequencing errors may artificially inflate UMI counts and distort the data. This correction is important for rare transcripts or markers, but commercially available HTO barcode sequences have a universal minimum hamming distance of three bases to ameliorate the risk of hashtag misidentification.
To evaluate the accuracy of BarCounter compared to a method including UMI correction, we ran BarMixer (described below) with HTO counts from either BarCounter or CITE-seq count with UMI correction and compared overlap in barcode classification and sample identification. For each of the six mixed wells, over 99.8% of barcodes identified as singlets were identical between the two methods (Additional file 1: Table S1). Across all wells, 113,414 barcodes were identified as singlets by BarCounter, only 60 of which were identified as doublets by CITE-seq Count. Counts for the top two hashtags for these barcodes differed between the methods by an average of 3.8% and 2.5% respectively, with the majority of barcodes having a count ratio between the top two hashtags greater than three, supporting their classification as singlets (Additional file 2: Table S2). All 113,268 barcodes identified as singlets by both tools had matching sample identity classifications. Therefore, the high dynamic range between positive (bound) and negative (unbound) HTO populations for each cell barcode enables hashtag analysis to be performed without computationally expensive UMI correction with little loss of accuracy in sample identification and doublet detection.   For each Cell Hashing well, we processed HTO FASTQ data using each tool and tracked performance using the Linux "time -v" command. For each well, BarCounter had the lowest memory usage (defined as maximum resident set size), lowest CPU usage, and lowest user (CPU) time. BarCounter was fastest in real (wall clock) time across all comparisons with the exception of the 64,000 and 80,000 cell wells, in which eight-threaded KITE processing was 7% and 15% faster, respectively (Fig. 2D-F). Due to the low-dimensional nature and inherent background signal of HTO data, we opted to output results in the universally readable comma separated values (CSV) format rather than a sparse matrix format. Despite this change, BarCounter outputs were the smallest in terms of disk space across all comparisons (Additional file 4: Table S4).
Based on these performance metrics, we estimate that data from an eight well experiment loading 16,000 cells and sequencing to a depth of 40M reads (~ 2500 reads per cell in this experiment) per well could be processed in parallel on a modest 8 CPU, 20 GB RAM computer in less than five minutes. These results demonstrate that BarCounter is ideally suited for the parallel processing of large Cell Hashing datasets, including when well number, cell recovery, and sequencing depth are high.

Separation of sample data using BarMixer
We developed a second tool to apply Cell Hashing to samples distributed across multiple wells (Fig. 1B). BarMixer is an R package and set of Rmarkdown notebooks that enables separation of samples within each well (splitting by hash) and reassembly of each sample across all wells into sample-specific output files (merging by hash). First, HTO counts generated by BarCounter are processed to identify a threshold value for each HTO barcode, assign each cell barcode to its corresponding sample(s) as a singlet, doublet, or multiplet (Fig. 2G, "Methods"), and generate an HTML-based report for HTO category counts and cutoffs (including Fig. 2H). Then, HDF5-formatted count matrix results from Cell Ranger count (10x Genomics) are preprocessed to add multiple points of cell metadata, assign each cell barcode with a universally unique identifier (UUIDs) to avoid cell barcode conflicts between wells, and generate a QC report for each well. Next, the HTO category and count data, as well as the metadata-tagged HDF5 count matrix file are used to split each well, create separate HDF5 files for each sample and a separate file for multiplets, and generate a report of sample metadata for each well (including RNA-seq read usage as displayed in Fig. 2I). Finally, the results from each sample across all wells are merged into a single output file per sample, and a final report summarizing HTO categories and RNA-seq QC characteristics is generated for all wells (including Fig. 2J-L). This modular series of steps and reporting allows for rapid assessment of results using the final summary report, as well as step-by-step troubleshooting of each major process in the sample demultiplexing pipeline.

Evaluation of sample assignments
To evaluate the fidelity of BarWare's sample assignments, we utilized the Barware pipeline to process a progressive cell overloading dataset, then performed analysis using Seurat v4 [8] to confirm that the samples identified via Cell Hashing and the original FACS sorted populations were in agreement (Fig. 3).
Following simple QC filtering, we performed dimensionality reduction, clustering, and visualized the results using uniform manifold approximation and projection (UMAP). We observed that cells from all eight wells were mixed evenly, there was complete overlap between technical replicates, and that each sorted FACS population showed a high degree of separation from the others (Fig. 3C, E). We then mapped our data onto the reference PBMC CITE-seq dataset described in [8], and transferred the reference cell type labels to our dataset (Fig. 3D, F). As expected, cells with hashes from the non-T cell FACS population were assigned to non-T cell identities (32,069 of 32,199 cells; 99.6%), memory T cell sorts were assigned memory T cell type identities (35,101 of 36,932 cells; 95%), and naïve T cell hashes were most frequently assigned to naïve T cell and double-negative T cell (dnT) identities (30,706 of 36,264 cells; 84.7%) (Fig. 3D-F).
We also visualized cell type-specific marker genes on our UMAP and found highly specific gene expression patterns that support the labelled cell type identities. Expression of CD3D was restricted to the naïve and memory T cell populations, CD14 and MS4A1 (CD20) expression identified classical monocytes and B cells, respectively in the non T cell population, and GNLY was specific to labelled NK and CD8 T cells (Fig. 3E). We divided the T cell compartment into naïve and memory based on expression of CCR7 or S100A4 [7], respectively (Fig. 3F), and found their expression to be mutually exclusive and constrained to the expected FACS populations.
Taken together, the gene expression results show agreement between sample assignments from FACS and Cell Hashing, and confirm that BarWare demultiplexes mixed samples to a high degree of accuracy.

Conclusion
We have demonstrated the advantages and efficiency of BarWare through its application to a large, multi-well Cell Hashing experiment representing a broad range of cell overloading. BarCounter outperformed other HTO counting tools in terms of speed and computing resources with no decrease in accuracy. BarMixer performs barcode demultiplexing and provides thorough reports detailing QC metrics, and produces merged, sample-specific analysis ready data files along with reports describing the results by sample, by well, and by batch.
In addition, our cell overloading dataset demonstrated the utility of BarWare outputs in simplifying downstream analysis of complex experiments. Merged outputs reduce the number of output files and eliminate manual separation of samples, while maintaining experimental metadata such as the original 10x well, identified hashtag, and barcode classification. BarMixer's split and merge approach allows analysis of separate samples, independent of the cell pooling performed at the bench, which we have utilized to enable scalable multimodal immunosurveilance studies [5]. We expect this feature to become increasingly beneficial as other research institutes and large consortia scale single-cell data generation to the order of tens of millions of cells. Finally, the cell overloading dataset provided with these tools should be useful in the development of new methods for rapid sample demultiplexing.
BarWare provides a comprehensive set of tools which lowers the barrier to entry of Cell Hashing workflows for small laboratories in the field of single-cell sequencing, and should be useful for core facilities that can use cell hashing to mix and overload samples to increase throughput and allow their customers to use only a fraction of one or many wells.

Sample processing
Biological specimens were purchased from Bloodworks Northwest as freshly drawn whole blood. All sample collections were conducted by Bloodworks Northwest under IRB-approved protocols, and all donors signed informed consent forms. PBMCs were isolated in-house using Ficoll Premium (GE Healthcare, 17-5442-03), were cryopreserved using Cryostor10 (StemCell Technologies, 07930), and stored in liquid nitrogen until use. PBMCs were thawed at 37 °C using AIM V medium (Gibco, 12055091).

10x library preparation
Libraries were prepared using the Chromium Single Cell 3′ v3 reagent kit (10x Genomics, 1000075) following the 10x Genomics User Guide (CG000183 Rev A), with the only modification being cell overloading. All libraries were sequenced on an Illumina NovaSeq S4 flowcell. Target read counts were 30,000 reads per cell for RNA libraries and 2,000 reads per cell for HTO libraries.

HTO counting
Hashtag counting was profiled using the Linux "time -v" command (GNU time v1.

HTO category assignment
Cells were assigned to individual HTO-defined samples, doublet, multiplet, or no hash categories using a multi-step process contained in the BarMixer package for R, all of which are performed in sequence for a given well using the hto_processing.Rmd script provided in BarMixer. The matrix of HTO counts per cell barcode is read from BarCounter outputs, and cutoffs for positive or negative cell barcodes are defined for each HTO separately. Cutoffs are determined by removing all counts below 10. Then, a test of unimodality is performed using the modetest function from the multimode package for R (v1.4 [1], parameters: method = "HH" and B = 20) to use the Dip Test of Unimodality [9] with 20 replicates. If the distribution is unimodal, the cutoff is set to the mean value plus 2 times the standard deviation of log-transformed values. This allows capture of some positive hashes when the distribution of hashes is not bimodal, though clear bimodal separation is ideal. If the distribution of counts is not unimodal, the values are log-transformed, and 2 center K-means clustering is performed using the base R kmeans function. Cluster centers are then compared to determine if the higher center is more than fourfold greater than the lower center. If so, the cutoff is set to the minimum value in the higher cluster. Otherwise, the cutoff is set to the maximum value of all cell barcodes, and no barcodes are considered passing. After setting a cutoff for each HTO, cell counts are converted to a binary matrix of passing (1 = greater than or equal to the cutoff ) or failing (0 = less than the cutoff ) values, and the number of passing values are counted for each cell barcode. Cells with a single passing value are assigned to "singlets", two passing values to "doublets", more than two passing values to "multiplets", and no passing values to "no hash" categories. This information is used to generate a table of hashing categories and the HTO barcode(s) assigned to each cell barcode.

Splitting and merging data by sample
After performing HTO category assignment for each well, a second script in the Bar-Mixer package, split_h5_by_hash.Rmd, is used to split singlet cells from each sample and from non-singlet categories. This script reads both the HTO category assignment table generated above for each well and the HDF5-formatted count matrix generated by Cell Ranger (10x Genomics). For each well, this script generates a separate HDF5 file for each sample per well. After performing this split step for each well in the experiment, a third script from BarMixer, merge_h5_by_hash.Rmd, assembles the HDF5 files for each sample across all wells into a single HDF5 output, and uses the combined information from these files to generate a comprehensive QC report for data from all wells. All steps for category assignment, splitting, and merging can be performed using wrapper script provided in the BarWare-pipeline repository, 02_ run_BarMixer.sh, available at https:// github. com/ Allen Insti tute/ BarWa re-pipel ine.

RNA-seq analysis
Merged HDF5 files from the final step of the BarWare pipeline were used as input and analyzed using Seurat (v4.0.3 [8]). Singlet data was read using the BarMixer (v1.2.0) function read_h5_seurat and merged into a single Seurat Object . Low quality barcodes and extreme outliers were filtered out by subsetting barcodes with less than 25% mitochondrial counts, RNA UMI counts of at least 1000 and less than 25,000, and at least 500 genes detected. We normalized the data using the Seurat function SCTransform [6], performed dimensionality reduction using the RunPCA function, generated a two-dimensional UMAP projection from the first 50 principal components using the RunUMAP function, and clustered the cells using the first 50 principal components using the FindNeighbors and FindClusters functions. We mapped our dataset to a reference PBMC CITE-seq dataset from [8] using the FindTransfer-Anchors function (parameters: dims = 1:50) and transferred cell type labels from the reference to our dataset using the MapQuery function.
Comparison of barcode classifications between HTO counting tools was performed using Python (v3.7.3) and the Pandas module [14].